perm filename SPELL[1,RWF] blob sn#723741 filedate 1983-09-27 generic text, type C, neo UTF8
COMMENT āŠ—   VALID 00003 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002			Dynamic Programming/Spelling Checker
C00008 00003
C00011 ENDMK
CāŠ—;
		Dynamic Programming/Spelling Checker

Increasingly, people are using computers as an aid to good writing.  With a
good text-editing program, the writer can by a few keystrokes change a name
throughout a manuscript, can move paragraphs around without scissors and
paste, can indent and center and justify effortlessly.  One valuable
component of a good text editor is a spelling checker, which warns the 
writer of words not found in the dictionary.  If the spelling checker can
suggest a similar word as the correct spelling, such as ``fluorine'' for
the author's ``flourine'', or ``minuscule'' for ``miniscule,'' it may help
even more.  To find a similar word, a spelling checker must have a quantitative
measure of the difference between words.  A popular measure is to count the
smallest number of changes that will turn one word into the other.

The allowed changes are usually four:

     (1) Deletion of a letter: ``sherbert'' becomes ``sherbet.''
     (2) Insertion of a letter: ``curiculum'' becomes ``curriculum.''
     (3) Transposition of adjacent letters: ``niether'' becomes ``neither.''
     (4) Replacement of a letter: ``defence'' becomes ``defense.''

It is easy to design a program to test if two words differ by a single change,
but to find the similarities between Wednesday and Whensdi, Wooster and Worcester,
Chumley and Chaulmondeley, takes work.  The best method known iterates over all
the prefixes of the two words, starting with the shortest ones, and calculates
the minimum number of changes needed to change the one into the other; these
differences are stored in a table, for later use in dealing with longer prefixes.
The last ``prefixes'' looked at are the complete words themselves.

Let`s look at a concrete example, the words ``warranty'' and ``guarantee.''
Starting with the prefixes of length one, ``w'' becomes ``g'' by a single
letter replacement, so that distance is 1.  For ``w'' to become ``gu'' requires
more than one operation; if we suspect that the last operation is the addition
of the final ``u,'' ``w'' must first have been changed into ``g,'' we already
know that requires one operation, so we have found a way to change ``w'' to
``gu'' in two steps.  Each other way of getting the ``u'' has to be tried
before confirming that two steps is the least possible.


	G   U   A   R   A   N   T   E   E

    0   1   2   3   4   5   6   7   8   9

W   1   1   2   3   4   5   6   7   8   9

A   2   2   2   2   3   4   5   6   7   8

R   3   3   3   3   3   4   5   6   7   8

R   4   4   4   4   3   4   5   6   7   8

A   5   5   5   5   4   3   4   5   6   7

N   6   6   6   6   5   5   3   4   5   6

T   7   7   7   7   6   6   4   3   4   5

Y   8   8   8   8   7   7   5   4   4   5


where the marked path corresponds to the sequence

	WARRANTY
	GARRANTY
	GUARRANTY
	GUARANTY
	GUARANTEY
	GUARANTEE

Here is the algorithm in Pascal.  The two words are in character arrays
W1 and W2; their lengths are in W1L and W2L.

	FOR I:= 0 TO W1L DO
		TABLE[I,0]:= I;

	FOR J:= 1 TO W2L DO
		TABLE[0,J]:= J;

	FOR I:= 1 to W1L DO
		For J:=1 TO W2L DO
			BEGIN
			IF W1[I]=W2[J] THEN BEST:=TABLE[I-1,J-1]
			ELSE BEST:=TABLE[I-1,J-1]+1;

			IF TABLE[I-1,J]<BEST
			THEN BEST:= TABLE[I-1,J]+1;

			IF TABLE[I,J-1]<BEST
			THEN BEST:=TABLE[I,J-1]+1;
			IF(I>1) AND (J>1) AND (W1[I-1]=W2[J] AND (W1[I]=W2[J-1])
			AND (TABLE [I-1,J-1]<BEST)
				THEN BEST:=TABLE[I-1,J-1]+1;
			TABLE[I,J]:=BEST
			END

		WRITE (W1,W2, TABLE[W1L,W2L])

At a certain stage, we have partly filled the table, as shown below:


	G   U   A   R   A   N   T   E   E
    0   1   2   3   4   5   6   7   8   9 
W   1   1   2   3   4   5   6   7   8   9
A   2   2   2   2   3   4
R   3
R   4
A   5
N   6
T   7
Y   8

Wanting the distance from ``wa'' to ``guaran,'' we try the possibilities.

   (1) The final ``a'' was deleted, after changing ``w'' to ``guaran'' in
       6 steps; total, 7.

   (2) The final ``n'' was inserted, after changing ``wa'' to ``guara'' in
       4 steps; total, 5.

   (3) Transposition does not change ``wa'' to ``an'', so we drop that approach.

   (4) Replacement of ``a'' by ``n,'' after changing ``w'' to ``guara'' in
       5 steps; total, 6.

Insertion yields the shortest solution at 5 steps, so we enter 5 into the next
vacancy in the table.  When the table is completed, it is:

*****************
(page missing)

The underlying method of this algorithm finds the best way to accomplish a
task by finding and recording the best way to accomplish a modest number of 
partial tasks, at least one of which has to be done along the way, and then
trying all ways that the complete task can be a one-step continuation of
the partial task.  The method is called dynamic programming. 
************[references]

Exercise: Program the difference calculation taking into account estimates of
probability of each error.  Assume a replacement is much more likely for
letters adjacent on the keyboard.  Assume insertion and deletion are more
likely for a doubled letter, or for a vowel next to another vowel.  Assess
a cost for each change, roughly proportional to the logarithm of its estimated
likelihood.